18.466, Dudley 
March 11, 2003 



CHAPTER 1. DECISION THEORY AND TESTING SIMPLE HYPOTHESES 

1.1 Deciding between two simple hypotheses: the Neyman-Pearson Lemma. 

Probability theory is reviewed in Appendix D. Suppose an experiment has a set X of 
possible outcomes. The outcome has some probability distribution fj, defined on X. In 
statistics, we typically don't know what is, but we have hypotheses about what it may be. 
After making observations we'll try to make a decision between or among the hypotheses. 
In general there could be infinitely many possibilities for but to begin with we're going 
to look at the case where there are just two possibilities, ji = P or ji = and we need 
to decide which it is. For example, a point a; in X could give the outcome of a test for a 
certain disease, where P is the distribution of x for those who don't have the disease and 
Q is the distribution for those who do. 

Often, we have n observations independent with distribution //. Then X can be 
replaced by the set X'^ of all ordered n-tuples (xi,... ,Xn) of points of X, and // by 
the Cartesian product measure /x x • • • x of n copies of /x. In this way, the case of n 
observations xi, . . . , a;^ reduces to that of one "observation" {xi, . . . , Xn). 

The probability measures P and Q are each defined on some cr-algebra B of subsets 
of X, such as the Borel sets in case X is the real line R or a Euclidean space. A test of the 
hypothesis that /j, = P will be given by a measurable set A, in other words a set A in B. If 
we observe x in A, then we will reject the hypothesis that /x = P in favor of the alternative 
hypothesis that jj. — Q. Then P{A) is called the size of the test A (at P). The size is 
the probability that we'll make the error of rejecting P when it's true, i.e. when = P, 
sometimes called a Type I error. On the other hand, Q{A) is called the power of the test 
A against the alternative Q. The power is the probability that when Q is true, the test 
correctly rejects P and prefers Q. The complementary probability 1 — Q{A) is sometimes 
called the probability of a Type II error. Given P and Q, for the test A to be as effective 
as possible, we'd like the size to be small and the power to be large. In the rest of this 
section, it will be shown how the choice of A can be made optimally. 

Example 1.1.1. Let X = M and let P and Q be normal measures, both with variance 
0.04, P = A^(0,0.04) and Q = A^(l,0.04). Larger values of x tend to favor Q, so it seems 
reasonable to take A as a half-line [c, oo) for some c. At x = 1/2, the densities of P and 
Q are equal. For a; < 1/2, P has larger density. For x > 1/2., Q does. So if we have no 
reason in advance to prefer one of P and Q, we might take c = 1/2. Then the probabilities 
of the two types of errors are each about 0.0062 (from tables of the normal distribution). 
In other words the size is 0.0062 and the power is 0.9938. If the variances had been larger, 
so would the error probabilities. 

It's not always best to prefer the distribution {P or Q) with larger density at the 
observation (or vector of observations). In testing for a disease, an error indicating a 
disease when the subject is actually healthy can lead to further, possibly expensive tests 
or inappropriate treatments. On the other hand the error of overlooking a disease when 
the patient has it could be much more serious, depending on the severity of the disease. 
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Numerical values called losses will be assigned to the consequences of mistaken deci- 
sions. Let L^iy be the loss incurred when ^ is true and we decide in favor of u. A correct 
decision will be assumed to cause zero loss, so Lpp = Lqq = 0. The losses LpQ and Lqp 
will be positive and in general will be different. 

Also, the statistician may have assigned some probabilities to P or Q in advance, 
called prior probabilities, say tt{P) = 1 — 7r{Q) with < tt{P) < 1. For example it could 
be known from other data (approximately) what fraction of people in a population being 
tested have a disease. The part of statistics in which prior probabilities are assumed to 
exist is known as Bayesian statistics, as contrasted with frequentist statistics where priors 
are not assumed. In this book, both are treated. Later on, some pros and cons of the 
Bayesian and frequentist approaches will be mentioned. 

It will turn out that the best tests between P and Q will be based on the ratio of 
densities of P and Q, called the likelihood ratio, defined as follows. In general, P or Q 
could have continuous or discrete parts, but P and Q are always absolutely continuous 
with respect to P + Q, so that there is a Radon-Nikodym derivative (RAP, 5.5.4) h{x) = 
{dP/d{P + Q)){x). Then dQ/d{P + Q) = l~h. The likelihood ratio Rq/p{x) of Q to P at 
X is defined as (1 — h{x))/h{x), or +oo if h{x) = 0. The likelihood ratio, like h, is defined 
up to equality {P + (5)-almost everywhere. 

If P and Q have densities / and g respectively with respect to some measure, for 
example Lebesgue measure on R, then we can take Rq/p{x) = g{x)/ f{x) if f{x) > 0, or 
+00 when g{x) > = f{x), or when g{x) = = f{x). For a proof, see Appendix A. 

In Example 1.1.1, Rq/p{x) = e25(^-0-5) Qr, let P and Q both be Poisson distributions 
on the set N of nonnegative integers with P{k) — Px{k) = e~'^\^ /k\ and Q = Pp for some 
p. Then RQ/p{k) = e^-P{p/\)'' for all keN. 

The sizes a = 0.05, 0.01 and 0.001 were chosen rather arbitrarily in the first half of the 
20th century and used in selecting tests. So, if a test A has size a — 0.05 or less at P, and 
the observation x is in A, the outcome is called "statistically significant" and the hypothesis 
P is rejected. If a < 0.001 the outcome is called "highly significant." The levels 0.05 etc. 
are still in wide use in some applied fields, such as medicine and psychology, although they 
are no longer very popular among statisticians themselves. For discrete distributions, not 
many sizes of tests may be available, as in the following: 

Example 1.1.2. Let X = {0,1,2}, P(0) = 0.8, P(l) = 0.05, P(2) = 0.15, Q(0) = 
0.008, (5(1) = 0.002, Q(2) = 0.99. Then Rq/p{x) is 0.01, 0.04, and 6.6 for x = 0,1,2, 
respectively. 

Example 1.1.2 suggests that, at least for discrete distributions, one not insist on 
conventional, specific sizes for tests. Another approach is to extend the definition of tests 
as follows. A randomized test is a measurable function / defined on X with < f{x) < 1 
for all X. In such a test, Q is chosen with probability f{x). Specifically, let y be a random 
variable uniformly distributed in [0, 1] and independent of x. If x is observed, then we 
decide in favor of Q if y < f{x) and P otherwise. The size of the randomized test / 
(at P) is defined as / / dP and the power of / (at Q) as J f dQ. Again, the size is the 
probability of Type I error and the power is the probability of not making a Type II 
error. If / is the indicator function of a set A then a randomized test reduces to a test as 
previously defined. 
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In Example 1.1.2, the only non-empty non-randomized test of size < 0.05 is {1}, but 
this test has very low power 0.002. The empty set gives a test with size but power also 
0. A randomized test of size 0.05 is /(O) = /(I) = and /(2) = 1/3, which has power 
0.33, much better than 0.002. 

A (randomized) test f of P vs. Q will be said to dominate another test g, or to be as 
good as g, if both / / dP < J g dP and / / dQ > J g dQ. If in addition at least one of the 
two inequalities is strict, / / dP < f g dP or / / dQ > f g dQ, then / is said to dominate 
g strictly, or to be better than g, or to improve g. If a test / dominates a test g then / is 
as good as g or better, both as to size (at P) and power (at Q). 

If P is true, then the average or expected loss in using the test /, called the risk r{f, P), 
is LpQ J f dP. If Q is true, the average loss in using / is the risk r(/, Q) = Lqp Jl — f dQ. 
Note that when P and Q are interchanged, so are / and 1 — /. 

For any prior tt, the overall average risk for /, called the Bayes risk, is 

r(/) = 7:{P)r{f,P)+n{Q)r{f,Q). 

If / dominates g, then r(/) < r{g), and if the domination is strict, then r(/) < r{g). So, 
although the notion of (strict) domination is defined without reference to losses or priors, 
the notion that a test / which dominates g is "better" than g does hold in terms of losses 
and priors. 

Randomized tests are very rarely used in applications. An experimental scientist would 
not want to make a decision based on the random variable Y extraneous to (independent of) 
the actual experiment, if < /(x) < 1. But, randomized tests provide a way to formulate 
the merits of tests based on the likelihood ratio while avoiding the pitfalls indicated by 
Example 1.1.2. It turns out that for the best tests, randomization (0 < f{x) < 1) only 
occurs when the likelihood ratio Rq/p{x) has one value c, depending on the test (Theorem 
1.1.3 below). Also, in the Bayes case, what happens when Rq/p = c doesn't affect the 
Bayes risk (Remark 1.1.9). 

A randomized test g will be called inadmissible if some other randomized test / strictly 
dominates g. If there is no such /, then g is called admissible. The following main theorem 
characterizes admissible tests of P vs. 

1.1.3 Theorem (Neyman-Pearson Lemma). For any two different probability mea- 
sures P and Q on {X,B), a randomized test / of P vs. Q is admissible if and only if 

there is some c with < c < oo such that f{x) = l for (P -|- (5)-almost 

(1.1.4) all X satisfying Rq/p{x) > c or Rq/p{x) = +oo, and f{x) — 

for (P -|- Q)-almost all x satisfying Rq/p{x) < c or Rq/p{x) = 0. 

Every randomized test g is dominated by a randomized test / satisfying (1.1.4) where 
for some measurable function h, f{x) = h{RQ/p{x)) with < h{y) < 1 for all y. 

From Theorem 1.1.3, it follows directly that: 

1.1.5 Corollary. For any P ^ Q on {X,B), the non-randomized tests {Rq/p{x) > b} 
and {Rq/p{x) > c} are admissible whenever < 6 < oo and < c < oo. 
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Proof of Theorem 1.1.3. For any c with < c < +00 and 7 with < 7 < 1, let 

fc-yix) := hc,^{RQ/p{x)) where hc^j{y) = 1 for y > c, 7 for j/ = c, and for y < c. Note 
that each such /c ,y is a randomized test. Then we have 

1.1.6 Lemma. For any P ^ Q on X and < a < 1, there is some fc,-y with size a. 

Proof. If a = take c :— c(0) := +00 and 7 := 7(0) := 1. (Any 7 would give 
size 0, but 7 = 1 will be useful later for admissibility.) If a = 1 let c := c(l) := and 
7 :— 7(1) := 1. (The latter is needed to get size 1 if P{Rq/p = 0) > 0, although then 
the test won't be admissible.) 

If < a < 1 let c := inf{t : P{Rq/p > t) < a}. Then < c < +00, P{Rq/p > 
c} < a and P{Rq/p > t) > a for any t < c, so P{Rq/p > c) > a. If P{Rq/p = c) = 0, 
any 7 will work, say 7=1. Otherwise let 

^ ^{a) := (a - P{i?Q/p > c})/P(i?Q/p = c), 

noting that 

< a - P(i?Q/P > C) < P{Rq/P > c) - P(i?Q/P > C) = P(i?Q/P = c), 

so < 7 < 1. In each case, fc-y has size a. □ 

Let c := c(a) := c(a, P, Q) as defined in the last proof. Let g be any randomized 
test and let G be the randomized test which equals 1 if Rq/p — +00, g if < Rq/p < 00 
and if Rq/p = 0. Any change from ^ < 1 to G = 1 on the set where Rq/p = +00 
can only increase the power of the test, without increasing the size since P = on that 
set. Likewise, any change where Rq/p = from g > to G = can only decrease the 
size without loss of power. So G dominates g, strictly unless g — G almost everywhere for 
P + Q. 

1.1.7 Lemma. Let g be any randomized test of P vs. Q. Let its size be a. Then / := /c,^ 
dominates g for c := c{a) and 7 = 7(a). 

Proof. The two tests both have size a, so it needs to be shown that / has at least as 
large power as g. If c = +00, then a = and 7 = 1. So = almost everywhere for P, 
and so almost everywhere for Q on the set where Rq/p < 00. So g < f almost everywhere 
for P + Q, and / does dominate g. So we can assume c < 00. Now 

{/ > ^} C {/ > 0} C {Rq/p > c} and C {/< 1} C {Rq/p < c}, 

so J{x) := (/ — g){x) ■ {Rq/p{x) — c) > for all x, where (in this case) • 00 is taken 
to be 0. Let R := Rq/p- Then 

< [ J dP ^ I {f - g){dQ - cdP), so 

JrKoo Jr<oo 

[ f-gdQ > c [ f-gdP 

Jr<oo Jr«x> 
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On the set where R = +00, f = 1 > g, so 



[ f-gdQ>0^c[ f-gdP a,nd [ f - g dQ > c [ f - g dP = 

since the two tests have the same size. So J f dQ > J g dQ. □ 

Now, to continue the proof of Theorem 1.1.3, let J(-) be defined as in the last proof. 
If P{J > 0) > 0, then since P{R = 00) = 0, / would have strictly larger power than g and 
g would be inadmissible. So if g is admissible, then g = f almost everywhere for P on the 
set where R ^ c. As shown above, g = G = for P- and so (P + (5)-almost all x such that 
Rq/p{x) = 0, and g = G = 1 for Q- and so (P + Q)-almost all x such that Rq/p{x) — +00. 
It follows that admissible randomized tests all satisfy (1.1.4), while Lemma 1.1.7 proved 
the last statement in the Theorem. What is left is to show that any randomized test / 
satisfying (1.1.4) for some c is admissible (whether or not / is of the form fc^y)- 

If c = +CXO, then the size of / is and by (1.1.4), / = /oo,i almost everywhere for P+Q. 
By the proof of Lemma 1.1.7, / dominates any other test of size 0, so / is admissible. If 
c — 0, then / has power 1, and by (1.1.4), / = 1r>o almost everywhere for P + Q. A test g 
of smaller size must have smaller integral for P over {R > 0}, so iJ.{{g < 1} n {i? > 0}) > 
for 11 = P and thus for = Q, so the power of g is less than 1. Thus g doesn't dominate 
/ and / is admissible. So we can assume < c < cx). 

If / is not admissible, let -F be a randomized test which dominates / strictly. Let a be 
the size of F and let h — 0(0;), 7 = 7(0;). So by Lemma 1.1.7, fb^-y is a test of size a which 
dominates F and /. If P{R = c) > 0, let z := Ep{f\R = c) dP/P{R = c). 

Then also z = EQ{f\R = c) since on {R = c}, Q = cP. Otherwise let 2; = 0. Then / has 
the same size and power as fc,z- 

Say f < g if f{x) < g{x) for all x. Then the set of all functions fa,t is linearly ordered 
since fa,t < fd,u if d < a or if d — a and t <u. If fb^-^ < fc,z, then the power of f^^-y must 
equal that of fc,z since it is larger or equal. So fb^-y = fc,z almost everywhere for Q, and 
so also for P since 6 > c > 0. On the other hand if fc,z < /6,7 then the sizes of these two 
tests must be the same, so the functions must be equal almost everywhere for P, and so 
also for Q since b < c < 00. In either case, the randomized tests fb^-y and fc^z are equal 
almost everywhere for P + Q, contradicting the assumption that F and so fb^-y dominated 
/ and so fc,z strictly, proving Theorem 1.1.3. □ 

A randomized test f of P vs. Q will be called a Bayes test for a given prior tt and 
losses LpQ, Lqp if / minimizes the Bayes risk r(/). Such tests are of the Neyman-Pearson 
form with c as follows: 

1.1.8 Theorem. A randomized test / is Bayes for given losses LpQ > < Lqp and prior 
TT with < 7t{P) = 1 - 7t{Q) < 1 if and only if / satisfies (1.1.4) for 

c := cpQ := c^,L := tt{P)Lpq/ {tt{Q)Lqp). 

Proof. It's easily seen that if some F strictly dominates /, then F has smaller Bayes risk 
than /. So a Bayes test / must be admissible, and so satisfy (1.1.4), which implies that 
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/ = almost surely for P where R = and / = 1 almost surely for Q where R = oo. 
Assuming / is Bayes, we need to find it on the set C := {0 < R < oo}. A Bayes test / 
must minimize 

/ n{P)LpQf{x) + (1 - n{P))LQp{l - f{x))R{x)dP{x) 
Jc 

or equivalently, minimize 

/ f{x){7r{P)LpQ-7r{Q)LQpR{x)}dP{x) 
Jc 

where < f{x) < 1. So we must have f{x) = if R{x) < c and f{x) = 1 if R{x) > c. (The 
behavior of / when i? = c is unrestricted.) So / satisfies (1.1.4) for c = cpq. The same 
calculation shows that conversely, any randomized test / satisfying (1.1.4) for c = cpQ is 
Bayes. □ 

1.1.9 Remark. It follows from Theorem 1.1.8 that under the given conditions there 
is no need for randomization to get a Bayes test for any given prior and losses: both 
{Rq/p > Cpq} and {Rq/p > cpq} give non-randomized Bayes tests. 

In Example 1.1.2, the following non-randomized tests are admissible, and are Bayes 

for the given values of Ct^.l = cpg: 

{0, 1,2} (always choose Q) is Bayes for < Ctt^l < 0.01; 
{1,2} (choose P only if a; = 0) is Bayes for 0.01 < c^,l < 0.04; 
{2} (choose Q only if x = 2) is Bayes for 0.04 < Ct^^l < 6.6; 
The empty set (always choose P) is Bayes for 6.6 < Ctt.l < oo. 

Thus, {2} is the preferred test if the losses LpQ and Lqp are not too different and 
neither are the priors 7r(-P) and n{Q). If Ct^^l is < 0.01 or > 6.6 then we need more 
than one observation to have an actual choice between P and Q. If c > 6.6 and most of 
many observations equal 2 then we can choose Q. Likewise, if c < 0.01 and most of many 
observations equal then we can choose P. 
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PROBLEMS 



1. Suppose we have two independent observations Xi,X2 in M with distribution /x where 
either fj, = P = N{0, 1) or /j, = Q = N{0,4). In R^, find a region giving an admissible 
test of P X P vs. Q X Q for c = 1. What are the size and power of this test? Hint: use 
polar coordinates. 

2. Let P have density on [0,oo) (so P is a standard exponential distribution) and let 
Q he the distribution oi X + 1 where X has distribution P. 

(a) What is the maximum power of a test of P vs. Q with size a < 0.05? 

(b) Answer the same question if we have n independent observations each with dis- 
tribution 11 = P or Q, so that P, Q are replaced by P", Q^. 

3. In Example 1.1.2, if < 7r(P) = 1 - 7r(Q) < 1, < Lpq < oo, and < Lqp < oo, 

(a) Under what conditions on 7r(P), Lpq, and Lqp is {0} a better test of P vs. Q 
than {2} is? 

(For a test set A, we choose Q and reject P if the observation is in A, otherwise we 
choose [don't reject] P.) 

(b) Under the conditions found in (a), what is the Bayes test or decision rule (i.e. the 
rule with minimum risk) for deciding between P and Q based on one observation in 
the given sample space {0, 1, 2}? 

4. Prove or disprove: let the set A define an admissible, non-randomized test of P vs. Q. 
Then for any other set B also defining such a test, and of the same size as for A, the 
two sets A and B difi^er only by a set of measure for P + Q. 

5. A statistic x is used in testing for a disease D. Let P = A'"(2, 0.25) be the distribution of 
X for those without D and Q = A'"(5, 1) be the distribution for those with D. Suppose 
that the losses are Lqp = $600,000 and Lpq =$500. It is known that about 1 percent 
of the population being tested has D. Find a test which minimizes the Bayes risk. Hint: 
the answer is a union of two half-lines. One is negligible. Why? 

NOTES TO SECTION 1.1 

The original publication of the Neyman-Pearson Lemma was by Neyman and Pearson 
(1933). Egon Pearson was the son of Karl Pearson, an even more prominent statistician, 
inventor of the chi-squared test. The proof given above is based on Lehmann (1986, Sec. 
3.2). Theorem 1.1.8 appears in Cox and Hinkley (1974, p. 419). 

Ferguson (1967) uses the terms that / is "as good as" or "better than" g. Bickel 
and Doksum (1977) said / "improves" g; in their second edition (2001) I didn't find that 
terminology or any replacement for it. 
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